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Abstract— Causal modeling has long been an attractive topic for many researchers and in recent decades there has 
seen a surge in theoretical development and discovery algorithms. Generally discovery algorithms can be divided into 
two approaches: constraint-based and score-based. The constraint-based approach is able to detect common causes 
of the observed variables but the use of independence tests makes it less reliable. The score-based approach produces 
a result that is easier to interpret as it also measures the reliability of the inferred causal relationships, but it is unable 
to detect common confounders of the observed variables. A drawback of both score-based and constrained-based 
approaches is the inherent instability in structure estimation. With finite samples small changes in the data can lead 
to completely different optimal structures. The present work introduces a new hypothesis-free score-based causal 
discovery algorithm, called stable specification search, that is robust for finite samples based on recent advances in 
stability selection using subsampling and selection algorithms. Structure search is performed over Structural Equation 
Models. Our approach uses exploratory search but allows incorporation of prior background knowledge. We validated 
our approach on one simulated data set, which we compare to the known ground truth, and two real-world data sets for 
Chronic Fatigue Syndrome and Attention Deficit Hyperactivity Disorder, which we compare to earlier medical studies. 
The results on the simulated data set show significant improvement over alternative approaches and the results on the 
real-word data sets show consistency with the hypothesis driven models constructed by medical experts. 

Index Terms— Causal modeling. Structural equation model. Stability selection. Multi-objective evolutionary algorithm, 
NSGA-II. 
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1 Introduction 

AUSAL modeling has been an attractive 
topic for many researchers for decades. 
Especially since the 1990s there has been 
an enormous increase in theoretical develop¬ 
ment, partly because of advances in graphi¬ 
cal modeling ||^. This has led to a variety of 
causal discovery algorithms in the literature. 
In general, causal discovery algorithms can 
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be divided into two approaches: constraint- 
based and score-based. Constraint-based ap¬ 
proaches work with conditional independence 
tests. First, they construct a skeleton graph 
starting with the complete graph and excluding 
edges between variables that are conditionally 
independent. Second, edges are oriented to ar¬ 
rive at a causal graph. Examples of consfraint- 
based approaches are the IC algorithm [|^|, 
PC-FCI Q, and TC ||^. Constraint-based ap¬ 
proaches do not have to rely on the causal 
sufficiency assumption, and then can detect 
common causes of the observed variables 
A disadvantage of this approach is the use 
of independence tests on a large number of 
conditioning variables, making it less reliable 
1^. Score-based approaches assign scores to 
particular graph structures based on the data fit 
and the complexity of the graph. Different scor- 
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ing metrics that are often used are the Bayesian 
score 0 and the BIC score 0. An example of a 
score-based method is greedy equivalence search 
(GES) 0. The goal of the score-based approach 
is to find the graph structure with the highest 
score. An advantage of this approach is that it 
measures the reliability of the inferred causal 
relationships, which makes the result easy to 
interpret | |10| ]. Score-based approaches typically 
do make the causal sufficiency assumption, and 
then cannot detect common confounders of 
the observed variables. Moreover, the involved 
optimization problem is usually NP-hard, so 
that different search heuristics are often used. 
The approach advocated in this paper is an 
example of a score-based approach. 

Furthermore, in causal modeling based on 
observational data, the causal models are unde¬ 
termined unless a preference for parsimonious 
models over more complex models is made 
0. In score-based approaches, such simplic¬ 
ity assumptions are typically implemented by 
adding a penalty for model complexity 0. 
Constraint-based approaches often make the 
implicit assumption of so-called causal faithful¬ 
ness 0, which states that there are no condi¬ 
tional independencies that hold in the density 
over a set of variables V, except those that 
are entailed by the causal structure. However, 
in practice faithfulness can be violated and 
better constrained-based approaches have been 
developed to handle this, such as CPC pTj] and 
ACPC @. 

A drawback of both score-based and 
constrained-based approaches, however, 
is the inherent instability in structure 
estimation. With finite samples small changes 
in the data can lead to completely different 
optimal structures. Outcomes of borderline 
independence tests can be incorrect and can 
lead to multiple errors when propagated by 
the discovery algorithm 0. 

The present work introduces a new score- 
based causal discovery algorithm, called stable 
specification search, that is robust for finite sam¬ 
ples based on advances in stability selection 
using subsampling and selection algorithms. 
Structure search is performed over Structural 
Equation Models (SEM), which is the most 
widely used language for causal discovery in 


various scientific disciplines. The method uses 
exploratory search, but allows incorporation of 
prior background knowledge. In order to show 
that our method can handle various kinds of 
data (continuous, discrete, and a combination 
of both) we evaluated our method on simu¬ 
lated and real-world data. The simulated data 
is used to compare our method with some 
advanced constrained-based approaches (PC- 
stable [131, CPC) and a score-based approach 
(GES). Specifically, we compare the robustness 
of each method in computing causal structure. 
The real-world data sets, about Chronic Fatigue 
Syndrome and Attention Deficit Hyperactiv¬ 
ity Disorder, are used to compare our results 
with some previous studies. The results show 
that our exploratory, hypothesis-free approach 
gives significant improvement over alternative 
approaches, and is able to obtain structure es¬ 
timates that are consistent with the hypothesis 
driven models constructed by medical experts 
based on medical data and years of experience. 

The rest of this paper is structured as follows. 
Section describes all the background material 
obtained from the existing literature. Section 
describes our robust score-based approach for 


^ presents experi- 
ated and two real- 


causal discovery. Section 
mental result on one simul 
world data sets. Section gives conclusions 
and suggestions for future work. 


2 Background 

2.1 Directed Acyclic Graph 

We first describe some graphical notation and 
terminology used in the remaining sections. A 
graph is a pair {V,E) with V a set of nodes 
and E a set of edges. A directed graph has all 
edges in E directed (arc); a single arrowhead 
on every edge, e.g., A ^ B. Directed cycles 
represent feedback or reciprocal relationships, 
e.g., A —)■ S —)■ A. A graph with no directed 
cycles is called acyclic. A graph which is both 
directed and acyclic is called a Directed Acyclic 
Graph (DAG) 0. Figure depicts a DAG of 
four variables. The skeleton of a DAG is the 
undirected graph that results from removing 
the directionality of every edge. A v-structure in 
a DAG G is an ordered triple (x, y, z) such that 
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G contains the directed edges x ^ y, z ^ y, 
and X and 2 are not adjacent in G |14|. 


2.2 Causal Modeling in SEM 

In this study, we focus on causal models with 
no reciprocal or feedback relationships, and 
no latent variables. Generally, there are two 
common ways of representing a model in SEM: 
by stating all relations in the set as equations, 
which is called a causal model, or by drawing 
them as a causal diagram (graph). The general 
form of the equations is 

Xi = i = l,...,n. (1) 

where pa^ denotes the parents which represent 
the set of variables considered to be direct 
causes of Xi and c, represents errors on account 
of omitted factors that are assumed to be mu¬ 
tually independent Q. 


2.3 Specification Search in SEM 

Typically, a SEM is used as follows: 1) set a 
hypothesis as the prior model, 2) fit the model 
to the data, 3) evaluate the model, and 4) mod¬ 
ify the model to improve the parsimony and 
score [ p3| . The last step is called specification 
search [16], | [T7| . This typical model refinement 
approach is hypothesis-driven. It works by 
adding or deleting some arcs between variables 
from the initial model. Typically only a few 
models are evaluated, making it difficult to 
derive causal relationships. 

An alternative approach is exploratory 
search in which no prior hypothesis is speci¬ 
fied. Typical approaches in the literature for ad¬ 
dressing the exponential search space include 
tabu search [l^j, genetic algorithms [[l^j, 120], 
ant colony optimization ||2T[j, and others [22], 


^ 



Eig. 1: A DAG of four variables. 


2.4 Multi-objective Optimization 

Eollowing the principle of Occam's razor, we 
should prefer models that are simple and fit the 
data well. These two objectives, however, are 
often conflicting as a well-fit model is likely to 
be a complex model. In this paper, we propose 
to make use of multi-objective optimization to 
explicitly optimize both objectives. 

In multi-objective optimization, optimal so¬ 
lutions are defined in terms of domination. A 
model xi is said to dominate model X 2 , if the 


following conditions are satisfied [25j: 


xi ^ X 2 iff 


Vze{l,...,M} /i(xi) </i(x2) 
3j e /,(xi) < /j(x 2 ) 


( 2 ) 

The first condition states that the model xi is 
no worse than X 2 in all objectives /,. The second 
condition states that the model xi is strictly 
better than X 2 in at least one objective. By using 
this concept, given the population of models 
P, we can partition P into n sets called fronts 
Fi,..., Fn, such that F^ dominates Fi where 
1 < k < I < n and the models within the 
same front do not dominate each other. The 
so-called Pareto Front or non-dominated set Fi 
includes models that are not dominated by any 
member of P. Essentially, using multi-objective 
optimization we efficiently find the best fitting 
models over a whole range of model com¬ 
plexities using a single coherent optimization 
approach. Figure provides a sketch. 



Fig. 2: Example of a population P partitioned 
into fronts Fi,... ,Fn when minimizing objec¬ 
tives fi and / 2 . Fi is the Pareto front not 
dominated by any member of P. 
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2.4.1 NSGA-II 

Non-dominated Sorting Genetic Algorithm II 
or NSGA-II l|2^ is a well-known multi¬ 
objective evolutionary algorithm (MOEA), still 
widely applied in various fields, such as image 
retrieval j^, reactive power planning |28|, 
building design and robot grippers I30|. 
A characteristic feature is fast non-dominated 
sorting which sorts models based on the con¬ 
cept of domination. With M the number of 
objectives and N the size of population, the 
time complexity has order O which is 

better than a naive approach with O {MN^). 
Another characteristic feature is crowding dis¬ 
tance sorting which is implemented to preserve 
the diversity among the solutions in the Pareto 
front. This feature sorts models based on the 
distance metric which explains the proximity 
of a model fo other models. 

The iterative procedure of NSGA-II shown 
in Figure is a sequence of sfeps started by 
generating a population of solutions P of size 
A. P is fhen manipulafed by genetic opera¬ 
tors such as selection, crossover, and mutation, 
forming a new population Q of size N. P 
and Q are fhen combined info population R 
with size 2N. After that R is sorted using 
fast non-dominated sorting, yielding a set of 
fronfs F. In fhe nexf iteration each front in F 
is sorted using the crowding distance sorting 
and the first N members are used to generate 
a new population P. At t = 0, P is formed by 
creating N random solutions sorted with fast 
non-dominated sorting. 


2.5 Stability Selection 

Structure estimation is a notoriously difficult 
problem, both because of compufational as- 
pecfs (finding the optimal structure can be NP 
hard) and because of instabilify (small changes 
in the data can lead to completely different op¬ 
timal structures). In this section we describe the 
method of ||M| for robust estimation of model 
strucfure based on subsampling in combina- 
fion with selection algorithms. The method has 
been shown to yield finite sample family wise 
error control and improved structure estimates. 

Let /She a sparse p-dimensional vector which 
generally represents, for example, fhe coeffi- 
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Fig. 3: Adopted from |26|. P is the current 
population with size N and is manipulated to 
make a new population Q. Both are combined, 
forming R, which will be sorfed using fast non- 
dominated yielding a set of fronts F. Every 
member of fronf ^ F will be assigned a so- 
called crowding disfance in order fo sorf 
The first N members of F will be selecfed fo 
be fhe nexf population P. 


cient vector in linear regression or the edges 
in a graph. In structure estimation the goal 
is to infer the set 5” = {k : /Sk f 0} of 
non-zero componenfs from noisy observations. 
Many methods tackle this problem by mini¬ 
mizing some loss function augmented with a 
regularization term to avoid overfitting. Usu¬ 
ally the regularization term is parameterized by 
A e A C M+ and each A leads to an estimated 
structure C p}. The objective is to 

determine A such that is identical to S with 
high probability. To this end, pT| introduces 
the concepts of selection probabilities and stability 
paths. 

Definition 1 (Selection probabilities). Let I be a 
subset of {1,... ,n} of size [n/2j randomly drawn 
without replacement, AT C {1,... ,p}, and S^{I) be 
the selected set for subsample I. The probability 
of K being in set S^{I) is 

= P C 

where the probability being is with respect to the 
random subsampling and possibly the construction 
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Definition 2 (Stability path). For each variable 
k = 1,... ,p the stability path is given by the 
selection probabilities {n^ : A e A}. 

Furthermore, in stability selection we do not 
select a single element from the set of models 
: A e A} as traditional methods do, but 
perturb the data many times and select struc¬ 
tures that occur in a large fraction of selected 
sets. To this end, Ij^ introduces the concept of 
stable variables. 

Definition 3 (Stable variables). The set of stable 
variables is defined as 

^stable = {A: : maxn^ > TTthr} 

AgA 

where TTthr is a cutoff with 0 < TTthr < 1- 

Variables with a high selection probability 
are kept whereas those with low selection prob¬ 
abilities are disregarded. The threshold TTthr is 
a tuning parameter but its influence is small 
and sensible values (e.g., TTthr G (0.6,0.9)) tend 
to give similar results. 


2.6 Model Equivalence 

There is one further subtlety that makes our 
approach for finding stable models (or sub¬ 
models) slightly more complicated than that in 
[311. If we find a particular model, we have to 
account for the fact that there may be different 
models that are observationally indistinguish¬ 
able. Causal models represented by DAGs have 
their corresponding model equivalent classes, 
called Completed Partially Directed Acyclic Graph 
(CPDAG). This means that every probability 
distribution derived from a model in a partic¬ 
ular CPDAG, can also be derived by models 
belonging to the same CPDAG. In SEMs, these 
models are called covariance equivalent Q. 

The characterization of equivalent structures 
is given by the following theorem 


Theorem 1. (Verma and Pearl, 1990) Two DAGs 
are equivalent if and only if they have the same 
skeletons and the same v-structures. 


Furthermore, a directed edge x —> y is com¬ 
pelled in Q if for every DAG Q' equivalent to Q, 
X —> y exists in Q. For any edge e in G, if e is not 
compelled in Q, then e is reversible. A CPDAG 


can be represented by a directed edge (arc) for 
every compelled edge and an undirected edge 
for every reversible edge p4| . 

Converting a model into a CPDAG allows 
one to observe the relations that hold among 
the variables. Arcs in a CPDAG indicate a 
cause-effect relation among variables since the 
same arc occurs in all members of the CPDAG. 
Undirected edges A — S in a CPDAG indicate 
that some members of the CPDAG contain an 
arc A ^ B whereas other members contain an 
arc B ^ A. 


3 Proposed method 
3.1 The General Idea 

Our proposed method can be divided into two 
phases. The first phase is search and the second 
phase is visualization. In the search phase SEM 
and NSGA-II are synergically combined for 
exploratory search of the model space. As por¬ 
trayed in Figure the inner loop is an iterative 
process, searching over the model space and 
returns a Pareto front of models. The outer loop 
is an iterative process that samples a different 
subset of the data in each iteration and at the 
end returns a number of Pareto fronts coming 
from those subsets. Each model returned by the 
outer loop is transformed into a CPDAG which 



Fig. 4: The proposed method consists of two 
phases: search and visualization. The search 
phase is an iterative process using an outer loop 
and inner loop that combines SEM, NSGA-II, 
and stability selection, which outputs all rel¬ 
evant edges and causal paths between two 
variables. The visualization phase displays the 
relevant relationships as a causal model. 
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are then used to compute the edge stability graph 
and the causal path stability graph. 

Definition 4. (Stability graphs) Let A and B 
be two variables and G a multiset (or bag) of 
CPDAGS. Let Gc be the submultiset of G con¬ 
taining all CPDAGS with complexity c. The edge 
stability for A and B at complexity c is the number 
of models in Gc for which there exists an edge 
between A and B (i.e., A ^ B, B ^ A, or A — B) 
divided by the total number of models in Gc- The 
causal path stability for A to B at complexity c is 
the number of models in Gc for which there is a 
directed path from A to B (of any length) divided 
by the total number of models in Gc- The terms edge 
stability graph and causal path stability graph are 
used to denote the corresponding measures for all 
variable pairs and all complexity levels. 


On top of the stability graphs we perform 
stability selection. In [311, stability selection is 
defined in terms of a regularization parameter 
A. In our approach we do not have a reg¬ 
ularization parameter and instead use model 
complexity (defined in Section |4.1| ) which is 
one of the objectives in our multi-objective op¬ 
timization approach. We therefore define two 
thresholds. The first threshold is the boundary 
of selection probability TTgei and corresponds 
to TTthr in ||^j. For example, setting TTgei = 0.6 
means that all causal relationships with edge 
stability or causal path stability (Figure ^ 
above this threshold are considered stable. The 
second threshold is the boundary of complexity 
TTbic, which is used to control overfitting and 
corresponds to minimal A in pT| . We set vTbic 
to the level of model complexity at which the 
minimum average Bayesian Information Criterion 
(BIC) score is found. For example, Tibic = 7 
means that all causal relationships with an 
edge stability or a causal path stability lower 
than this threshold (Figure are considered 
parsimonious. Causal relationships that intersect 
with the top-left region are considered both 
stable and parsimonious and called relevant. 

In the visualization phase we combine the 
stability graphs into a graph with nodes and 
edges. This is done by adding the relevant 
edges and orienting them using background 
knowledge (if any, see Section |3.2| > and the 
relevant causal paths. In addition we annotate 
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Fig. 5: Example stability graphs from an arti¬ 
ficial data set of 400 instances with six contin¬ 
uous variables, without prior knowledge, (a) 
Edge stability graph, (b) Causal path stability 
graph. Each line in (a) represents an edge be¬ 
tween a pair of variables and each line in (b) 
represents a causal path with any length from 
a variable to another variable. The threshold of 
selection probability, TTgei, is set to 0.6 and the 
threshold for model complexity, Tibic, is chosen 
to minimize the average BIC score. See the 
main text for more details. 


each edge with the highest selection probability 
it has across different model complexities in the 
top-left region of the edge stability. This visu¬ 
alization eases interpretation but the stability 
graphs are considered to be the main outcome 
of our approach. 

3.2 Constrained SEM 

In practise, one often has prior knowledge 
about the domain, for example, that A does 
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not cause B directly, denoted by ^4 74 B. 
The method proposed here can include such 
prior knowledge, extending previous work 
ll^ , since this translates to a DAG with no 
directed edge from A to sQ 

Model specifications should comply with 
any prior knowledge when performing spec¬ 
ification search and when measuring the edge 
and causal path stability. When DAGs are 
converted into CPDAGs in the outer loop, 
a constraint A B may be violated since 
arcs S —> A in the DAG may be converted 
into undirected (reversible) edges A — 5 in 
the CPDAG. In order to preserve constraints 
we therefore extended the efficient DAG-TO- 
CPDAG algorithm of l[T4| which runs in time 
0{\E\) given a DAG G = {V, E). 

Figure provides pseudocode for the con¬ 
strained DAG to CPDAG algorithm. Line 2 
produces a total ordering E' over the edges 
in DAG G. Lines 3-6 impose an arc upon the 
edges that match the constraints. Finally, Line 7 
uses [141 to label the remaining edges E\E' in 
G with "compelled" or "reversible" and returns 
the constrained CPDAG G'. 


A DAG without edges will always be trans¬ 
formed into a CPDAG without edges. A fully 
cormected DAG without constraints will be 
transformed into a CPDAG with only undi¬ 
rected edges. However, if background knowl¬ 
edge is added, a fully connected DAG will be 
transformed into a CPDAG in which the edges 
corresponding to the background knowledge 
are directed. From these observations it follows 
that in the edge stability graph all paths start 
with a selection probability of 0 and end up 
in a selection probability of 1. In the causal 
path stability graph when no prior knowledge 
has been added all paths start with a selection 
probability of 0 and end up in a selection prob¬ 
ability of 0. However, when prior knowledge 
is added some of the paths may end up in a 
selection probability of 1 because of the added 
constraints. 


1. This still allows for directed edges from B io A or indirect 
relations from A io B. 


3.3 stable Specification Search Algorithm 

Figure [^provides pseudocode for our approach 
(cf. Figure |^. Lines 3-18 represent the outer 
loop. Lines 6-16 represent the inner loops. 
Lines 19-21 compute stability graphs. 

An inner loop (Lines 6-16) starts by forming 
a population P of size N, initially at ran¬ 
dom, or else from a previous population using 
crowding distance sorting (Lines 7-12). Models 
are represented with a binary vector y with 
yi G (0,1} denoting the existence of some arc 
A ^ B. Line 13 forms a new population Q 
by manipulating P using binary tournament 
selection, one-point crossover, and one-bit flip 
mutation, which are compatible with a binary 
representation. The selection scheme selects N 
times two models from P and places the best 
model (i.e., lowest front or else smallest crowd¬ 
ing distance) in a mating pool Mpooi- One- 
point crossover takes two models from Mpooi 
and swaps the data after the crossover point 
(the middle). One-bit flip mutation flips each 
bit according to a predetermined rate. Line 14 
combines P and Q and sorts them using fast 
non-dominated sorting. Line 15 updates the 
Pareto front in Pi. 

An outer loop (Lines 3-18) randomly sam¬ 
ples a subset T from D with size [|D|/2J 
(Line 4), runs the inner loop / times to obtain 
a Pareto front (Lines 6-16), and stores it in 
H (Line 17). After J iterations, H contains J 
Pareto fronts. 

Lines 19-21 convert the J Pareto fronts in 
H from DAGs into CPDAGs using the algo¬ 
rithm in Figure and then computes the edge 
and causal path stability graphs. The stability 
graphs are considered to be the main outcome 
of our approach, but can also be visualized as 
a graph with nodes and edges. 

4 Experimental Study 

We implemented the stable specification search 
as an R package named stablespec. The 
package is publicly available at the Compre¬ 
hensive R Archive Network (CRAN)|^ so it can 
be installed directly, e.g., from R console by 

2. https://cran.r-project.org/web/packages/stablespec/ 
index.html 
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function consDag2Cpdag(DAG G, constraint C) 

E' ^ orderEdges(G) 
for every constraint c e C do 
get e e E' that matches c 

label e with "compelled" in the direction consistent with c 

end for 

return G' ^ labelEdges(G, E') t> label remaining edges using 114| 

end function 


Fig. 6: The constrained DAG-TO-CPDAG algorithm returns a CPDAG which is consistent with 
the added prior knowledge and extends pA| . The algorithm first labels the edges that match 
the constraints with "compelled" and then labels the remaining edges with "reversible" or 
"compelled" using [[T4|. 


1 

procedure stableSpecificationSearch(data set D, 

constraint C) 

2 

H^O 

> initialize 

3 

for jA— 0,..., J — 1 do 

> J is number of outer loop iterations 

4 

T ^ subset of D with size [ i9|/2j without replacement 

5 


0 initialize Pareto fronts to empty list 

6 

for i ^ 0,— 1 do 

> / is number of inner loop iterations 

7 

if i = 0 then 

8 

P A random DAGs consistent with C 

9 

P fastNonDominatedSort(P) 


10 

else 


11 

P crowdingDistanceSort(F) 

> draw the first N models 

12 

end if 


13 

Q ^ make population from P 


14 

F ^ fastNonDominatedSort(P'^Q) 


15 

El ^ pareto front of F and Fi 


16 

end for 


17 

H ^ H^Fi 

0 concatenation 

18 

end for 


19 

G consDag2Cpdag(iF, C) 


20 

edges ^ edge stability of G 


21 

paths path stability of G 


22 

end procedure 



Fig. 7: Stable specification search consists of an outer and an inner loop. The outer loop samples a 
subset of the data, and for every subset, the inner loop searches for the Pareto front by applying 
NSGA-II. The Pareto fronts are converted into constrained CPDAGs which are then used to 
compute the edge and causal path stability graph. 


typing install.package("stablespec") 

or from RStudio by using feature to install 
package. We also included a package documen¬ 
tation as a brief tutorial of using the functions. 
All experiments were run on an Intel Xeon E7- 
4870 v2 Processor 2.3 GHz, 15 Core, 96 of 32GB 
LRDIMM. 


4.1 Parameter Settings 

For all experiments, we employed the same 
set of NSGA-Il parameters and stability 
thresholds. We had 100 iterations in the outer 
loop, and in each iteration we drew a sub¬ 
sample with size [|i9|/2j. We did not do a 
comprehensive parameter tuning for NSGA-II, 











9 


instead, we followed guidelines provided in 
p4|. The parameters were set as follows: the 
number of generations (irmer loop) was 20, the 
size of the population P was 100, the crossover 
rate was 0.85, the mutation rate was 0.075 and 
with binary tournament selection. 

We score models using the chi-square and 
the model complexity. The is considered the 
original fit index in SEM and measures whether 
the model-implied covariance matrix is close 
enough to the sample covariance matrix |[35|. 

The model complexity represents how many 
predicted parameters the model contains. As¬ 
suming that variances of parameters are al¬ 
ways predicted, the maximum model complex¬ 
ity with n variables is given by n(n — l)/2. 

When using multi-objective optimization we 
minimize both the x^ model complexity 
objectives. These two objectives are, however, 
conflicting with each other. For example, min¬ 
imizing the model complexity typically means 
compromising the data fit. 


4.2 Application to Simulated Data 

4.2.1 Data Generation 


In this experiment we generated data using the 
Waste Incinerator network in Figure]^ which is 
a model of waste emissions from an incinerator 
plant p^ . This model contains both discrete 
and continuous random variables, with B the 
waste burning regimen, W the compositional 
differences in incoming waste, C the concen¬ 
tration of C02, F the filter state, E the filter 
efficiency, L fhe lighf penetrability, D the emis¬ 
sion of dust, Min the metals in waste, and Mont 
the metals emission. Following p7| , we treat 
all discrete variables as continuous. We added 
prior knowledge that none of the variables 
directly cause the filter state. We generated 10 
data sets containing 400 samples from this net¬ 
work using the BNT toolbox with the default 
parameter setting as described in ||^|. 


4.2.2 Performance Measure 
We compared the stable specification search 
with GES (score-based method), PC-stable, and 
CPC (both constrained-based methods). Our 
method intrinsically subdivides the data in a 
number of subsets, here 50 of size 200 samples. 



Fig. 8: The Waste Incinerator network. Rectan¬ 
gular nodes represent discrete variables, oval 
nodes represent continuous variables, and arcs 
represent direct causal relations. 


and then runs the multi-objective optimization 
to obtain 50 Pareto fronts (see Section |2.4[ >. 
For a fair comparison, for each algorithm we 
consider subsampling (e.g., p^), giving each 
method 50 subsets. For every subset, each al¬ 
gorithm returns a CPDAG from which we can 
derive fhe edges and causal paths. 

Since the true model of the Waste Inciner¬ 
ator data is known, we can measure the per¬ 
formance of bofh methods by means of fhe 
Receiver Operating Characteristic (ROC) curve 
l |40| . The True Positive Rate (TPR) and the False 
Positive Rate (FPR) are computed with respect 
to the CPDAG of the true model. For example, 
in the case of causal path stability, a true posi¬ 
tive means that a causal path with any length 
obtained through our approach or the PC al¬ 
gorithm is actually present in the CPDAG of 
the true model. By increasing the threshold vTsei, 
we increase the TPR at the expense of the FPR. 
In addition, we conducted three significance 
tests to compare the ROC curves. The first test 
|41| compares the Area Under the Curve (AUC) 
of the ROC curves based on the theory of U- 
statistics. The second test 1 |42| |, a modification 
of [431, compares the AUC of ROC curves 
that are generated from bootstrap replicates. 
The third test p4| compares the actual ROC 
curves by evaluating the absolute difference. 
The null hypothesis is that the AUC of the ROC 
curves of our method and the PC algorithm are 
equivalent. 

We repeated the above procedure 10 times 
on different Waste Incinerator data sets and 
computed the ROC curves using two different 
schemes: averaging and individual. In the av¬ 
eraging scheme, the ROC curves are computed 
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based on the average edge and causal path 
stability from different data sets. We conducted 
statistical significance tests on these average 
ROC curves. Conversely, in the individual 
scheme the ROC curves are computed directly 
from the edge and causal path stability on each 
data set. We conducted individual statistical 
significance tests on the ROC curves for each 
data set and then used Fisher's method, as de¬ 
scribed in 


1 |46| |, to combine these tests into 
a single test statistic. Both schemes are intended 
to show empirically and comprehensively how 
robust the results of each algorithm are across 
changes in the data. 


4.2.3 Discussion of Waste Incinerator Result 

Figure shows the ROC curves for (a) the 
edge stability and (b) the causal path stability 
from the averaging scheme. The corresponding 
AUCs for edge stability are 0.96 (stable specifi¬ 
cation search), 0.89 (PC-stable), 0.88 (CPC), and 
0.69 (GES). The AUCs for causal path stability 
are 0.98 (stable specification search), 0.85 (PC- 
stable), 0.88 (CPC), and 0.61 (GES). 

Table lists the results of the significance 
tests for both the averaging and individual 
schemes. The ROC and AUC for the edge 
stability are comparable with PC-stable and 
CPC (p-value > 0.1), but always significant 
(p-value < 0.01) compared with GES. The 
ROC and AUC for the causal path stability 
compared with CPC are marginally significant 
(p-value < 0.1) using the averaging scheme, 
but significant using the individual scheme (p- 
value < 0.01); compared with PC-stable sig¬ 
nificant (p-value < 0.05) using the averaging 
scheme, but highly significant using the in¬ 
dividual scheme (p-value < 10“®); compared 
with GES highly significant using both schemes 
(p-value < 10“®). To conclude, we show that 
the stable specification search obtains at least 
comparable performance as, but often signifi¬ 
cant improvement over alternative approaches, 
especially in obtaining the causal relations. 


4.3 Application to Real-world Data 

This section describes the results of applying 
our proposed method on two real-world data 
sets. Both of them are about particular diseases. 



FPR 

(a) 



Eig. 9: ROC curves for (a) the edge stability and 
(b) the causal path stability, for different values 
of TTsei in the range of [0,1]- In (a), the AUCs 
are 0.96 (stable specification search), 0.89 (PC- 
stable), 0.88 (CPC), and 0.69 (GES). In (b). The 
AUCs are 0.98 (stable specification search), 0.85 
(PC-stable), 0.88 (CPC), and 0.61 (GES). 


for which the underlying causal relationships 
are often not clear. Revealing such causal rela¬ 
tionships can lead to the development of (new) 
dedicated treatments and medications. Here, 
we consider data on Attention Deficit Hyper¬ 
activity Disorder (ADHD) and Chronic Fatigue 
Syndrome (CES). 

4.3.1 Performance Measure 

Since the true model is unknown we measure 
the performance of our method using the edge 
stability and causal path stability graphs. We 
set the thresholds to vTsei = 0.6 and vTbic to 
the minimum average of BIC scores. The rel- 
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TABLE 1: Table of p-values from comparisons between stable specification search and alternative 
approaches. For each significance test, we compared the ROC of the edge (Edge) and causal path 
(Causal) stability on both averaging (Ave.) and individual (Ind.) schemes. 



GES 

PC-stable 

CPC 

Significance test 

Ave. 

Ind. 

Ave. 

Ind. 

Ave. 

Ind. 

DeLong |41 

1 

Edge 

Causal 

0.003 
< 10-^ 

< 10-"' 
< 10-® 

0.317 

0.027 

0.175 
< 10-5 

0.284 

0.073 

0.131 

< 10-5 

Bootstrap [42 

^1 

Edge 

Causal 

0.003 
< 10-^ 

< 10-^ 
< 10-® 

0.296 

0.022 

0.135 
< 10-5 

0.261 

0.064 

0.098 
< 10-5 

Venkatraman [ 

441 

Edge 

Causal 

0.004 
< 10-^ 

< 10-^ 
< 10-5 

0.591 

0.023 

0.684 
< 10-5 

0.539 

0.096 

0.592 

0.005 


evant causal relations are those which occur in 
the top-left region (see Figure as example). 
We compare the stability graphs to studies 
reported in the literature. 

4.3.2 AfDfDlication to CFS 

In this experiment we consider a data set about 
Chronic Fatigue Syndrome (CFS) of 183 subjects 
p7| . Originally the data comes from a longi¬ 
tudinal study with five time slices, but in this 
paper, we focus only on one time slice repre¬ 
senting the subjects after the first treatment. 

The data set contains six discrete variables; 
fatigue severity assessed with the subscale 
fatigue severity of the Checklist Individual 
Strength (CIS), the sense of control over fa¬ 
tigue assessed with the self-efficacy scale (SES), 
focusing on symptoms measured with the Ill¬ 
ness Management Questionnaire, the objective ac¬ 
tivity of the patient measured using an actome- 
ter (oActivity), the subject's perceived activity 
measured with the subscale activity of the CIS 
(pActivity), and physical functioning measured 
with subscale physical functioning of the medi¬ 
cal outcomes survey (SF36). We refer to the origi¬ 
nal paper p7| , for a detailed description of the 
questionnaires used and the actometer. Miss¬ 
ing values were imputed using an imputation 
method Expectation Maximization implemented 
in SPSS | |48| . As all of the variables have large 
scales, e.g., in the range between 0 to 155, we 
treat them as continuous variables. We added 
prior knowledge that the variable fatigue does 
not cause any of the other variables directly. 

The total computation time for one subset 
was around 5.5 minutes. Figure [T^ shows that 


eight relevant edges were found. These edges 
are between pActivity and fatigue, focusing and 
fatigue, functioning and fatigne, control and 
fatigue, pActivity and focusing, pActivity and 
oActivity, focusing and control, and functioning 
and control. 


10 


Figure 

paths were 
are: pActivity 


shows that four relevant causal 
found. These causal paths 
to fatigue, control to fatigne. 


fnnctioning to fatigne, and focnsing to fatigue. 


The stability graphs can be combined into a 
model as follows. First, the nodes are connected 
according to the eight relevant edges obtained. 
Second, the edges are oriented according to the 
background knowledge added. The fact that 
the variable fatigue does not directly cause any 
other variable results in four directed edges, 
which, in this case, correspond exactly to the 
relevant causal paths obtained. The inferred 
model is shown in Figure 


A (direct) causal path X ^ Y in Figure 
indicates that a change in variable X causes 
a change in variable Y. All variables except 
for objective activity were found to be direct 
causes for fatigue severity, which are corrob¬ 
orated by literature studies. In p^ , changes 
in physical activity, sense of control, and focus 
on symptoms measured, were shown to result 
in changes in fatigue. In [ISOl, changes in per¬ 
ceived activity, sense of control, and physical 
functioning were shown to result in changes in 
fatigue. In p7| , an increase in sense of control, 
perceived activity, and self-reported physical 
functioning, as well as a decrease in focusing 
on symptoms resulted in a decrease of fatigue, 
whereas changes in objective activity did not 
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Fig. 10: The stability graphs for CFS together 
with TTsei and TTbic, yielding four regions. The 
fop-left region is the area containing the rel¬ 
evant causal relations, (a) The edge stability 
graph showing eight relevant edges, (b) The 
causal path stability graph showing four rel- 
evanf causal paths. See Tables and in Ap¬ 
pendix 1^ for more detail. 


result in any change in fatigue. 


4.3.3 Application to ADHD 

In this experiment we consider a data set about 
Attention Deficit Hyperactivity Disorder (ADHD) 
of 245 subjects with 23 variables | [5T| . Following 
[521, we excluded instances with missing val¬ 
ues and variables that either have insufficient 
instances or are considered irrelevant. The re¬ 
maining data set consists of 221 instances and 
six variables with gender the gender of subjects, 
AD the attention deficit measure, HI the assess¬ 
ment of hyperactivity/impulsivity symptoms, 
aggression the measure of aggressive behavior. 



Fig. 11: The inferred model of CFS by combin¬ 
ing fhe edge stability and causal path stability 
graphs. Each edge has a reliability score which 
is the highest selection probability in the top- 
left region of the edge stability graph. 


medication the medication status of subjects, 
and handedness represents whether a subject 
uses the right and/or left hand. Following p7| , 
we treat all discrete variables as continuous 
variables. We added prior knowledge that the 
variable gender does not cause any of the other 
variables directly. 

The total computation time for one subset 
was around 4.9 minutes. Figure shows that 
there are four relevant edges, namely between 
gender and AD, AD and medication, AD and 


HI, and HI and aggression. Moreover, Figure 12 
shows that there are seven relevant causal 
paths; gender to AD, gender to HI, gender to 
medication, gender to aggression, AD to HI, AD 
to medication, and AD to aggression. 


The stability graphs can be combined into a 
model as follows. First, the nodes are connected 
according to the four relevant edges obtained. 
Second, the edges are oriented according to 
the background knowledge added. The fact 
that the variable gender does not directly cause 
any other variable results in one directed edge 
gender ^ AD. Third, the edges are oriented 
according to the relevant causal paths obtained. 
This results in two directed edges, AD — )■ HI 
and AD medication. Since there is no rele¬ 
vant edge between AD and aggression and no 
relevant causal path from HI to aggression we 
cannot orient any other edges and therefore 
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Fig. 12: The stability graphs for ADHD together 
with TTsei and TTbic, yielding four regions. The 
top-left region is the area containing the rel¬ 
evant causal relations, (a) The edge stability 
graph showing four relevant edges, (b) The 
causal path stability graph showing seven rel¬ 
evant causal paths. See Tables and in Ap¬ 
pendix 1^ for more detail. 


carmot represent two of the relevant causal 
paths in the model. We loose some information 
when converting the stability graphs into a 
model. The inferred model is shown in Fig¬ 
ure [m 


The causal relations obtained for ADHD are 
corroborated by studies reported in the liter¬ 
ature. In ||52|, gender is shown to be a direct 
cause for attention deficit, attention deficit is 
shown to be a direct cause for both hyperac¬ 
tivity, medication, and aggression, and hyper¬ 
activity and aggression are related but neither 
variable is a direct cause for the other. 



Fig. 13: The inferred model of ADHD by com¬ 
bining the edge stability and causal path sta¬ 
bility graphs. Each edge has a reliability score 
which is the highest selection probability in the 
top-left region of the edge stability graph 


5 Conclusion and Future Work 

In the last decades the field of causal modeling 
has seen a surge in theoretical development 
and the construction of various causal discov¬ 
ery algorithms. In general, causal discovery 
algorithms can be divided into two approaches: 
constraint-based and score-based. A disadvan¬ 
tage, however, of current causal discovery al¬ 
gorithms is the inherent instability of structure 
estimation. With finite samples small changes 
in the data can lead to completely different 
optimal structures. 

The present work introduces a new 
hypothesis-free score-based causal discovery 
algorithm, stable specification search, that is 
robust for finite samples based on subsampling 
and selection algorithms. Our approach 
uses exploratory search to search over 
Structural Equation Models and allows for the 
incorporation of prior background knowledge, 
without the need to specify the complete 
model structure in advance. 

The comparison conducted on the simulated 
data shows that our method, the stable specifi¬ 
cation search, shows significant improvement 
over alternative approaches in obtaining the 
causal relations. The results on both real-world 
data sets, CFS and ADHD, are consistent with 
previous studies 1^, p^ , 1 [52| . In general 

we may conclude that our causal discovery 
algorithm is able to robustly estimate the un¬ 
derlying causal structure. 

Several issues have not yet been explored 
in our current approach that warrant further 
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research, such as latent variables and longitu¬ 
dinal data. Taking into account the existence of 
latent variables can further improve our struc¬ 
ture estimate by properly identifying depen¬ 
dencies between variables as an unmeasured 
common cause acting on both variables. In 
longitudinal data several subjects are measured 
at different time slices which provides a richer 
structure that can be incorporated in the causal 
discovery algorithm. A first attempt in this 
direction can be found in 

Our approach can be viewed as a novel ap¬ 
plication of multi-objective optimization. The 
main idea of stability selection |31j, is to in¬ 
crease the robustness of structure estimation 
by considering a whole range of model com¬ 
plexities. In the original work, this is done by 
varying a continuous regularization parame¬ 
ter. For causal discovery we have to explicitly 
consider different discrete model complexities. 
Furthermore, finding the optimal structure for 
each model complexity is a hard optimization 
problem. By rephrasing stability selection as a 
multi-objective optimization problem, we can 
jointly run over various model complexities 
and find the corresponding optimal structures 
for each model complexity. In this paper, we 
have used NSGA-II for multi-objective opti¬ 
mization, because of its popularity and avail¬ 
ability, but realize that more recent multi¬ 
objective optimization approaches |[54|, |55|, 
l^ , 1^] may be even more efficient. This is be¬ 
yond the scope of this work and left for future 
research. In the same spirit, one can easily com¬ 
bine freely available software packages, e.g., for 
scoring Structural Equation Models, bootstrap 
sampling, and multi-objective optimization, to 
build one's own robust structural estimation 
approach. 
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TABLE 2: Edge stability of CES TABLE 4: Edge stability of ADHD 


o- - 
A- - 
+ -- 

0 -- 

o- - 
A- - 
+ - - 
X- - 

O- - 

o- - 
A- - 


Lines 

- - □ — 

- - o- 

- - A- 

-- +--- 

-- o- 

- - □- 

- - o- 

- - A- 

-- +- 

- - X- 

- - O- 

- - □- 

- - o- 

- - A- 


- - O 

- - A 
-- + 

-- o 

- - o 

- - A 
-- + 

- - X 

- - O 

- - o 

- - A 


Edges 

fatigue and pActivity 
fatigue and control 
control and focusing 
fatigue and functioning 
control and functioning 
focusing and pActivity 
fatigue and focusing 
pActivity and oActivity 
functioning and pActivity 
control and pActivity 
control and oActivity 
focusing and oActivity 
fatigue and oActivity 
fucntioning and oActivity 
functioning and focusing 


o- - 

X- - 

o- - 
+- - 
A- - 

o- - 
O- - 
A- - 
X- - 
A- - 
+ -- 
O- - 


Lines 

- - o— 

- - □ — 

- - X- 

- - □- 

- - o- 

- - +— 

-- A- 

- - □- 

- - O- 

-- o--- 

- - A- 

- - X- 

- - A- 

- - +- 

- - O- 


- - O 

- - X 

- - o 

- - + 
-- A 

- - O 

- - O 

- - A 

- - X 

- - A 
-- + 

- - O 


Edges 
AD and HI 
AD and medication 
HI and aggression 
gender and AD 
aggression and AD 
AD and medication 
handedness and aggression 
medication and aggression 
gender and HI 
HI and handedness 
gender and medication 
gender and handedness 
AD and handedness 
gender and aggression 
medication and handedness 


TABLE 3: Causal path stability of CES TABLE 5: Causal path stability of ADHD 
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